Triton 编程入门：性能悖论：为何正确代码反而更慢

这性能悖论表明，一个数学上完美的内核（例如 $out = x + y$）如果无法分摊 GPU 硬件的固定开销，其实际性能可能反而比 CPU 循环更差。这种现象通常表现为 启动开销。

1. “正确性” 的误区

功能正确性并不等同于效率。尽管你的 Triton 代码可能正确地将任务分配到数千个线程中，但如果总工作量（N）较小，GPU 将处于低效利用状态。硬件花费在状态切换上的时间远多于实际执行算术运算的时间。

2. Python 测量陷阱

使用 Python 对 GPU 代码进行基准测试时 time.time() 存在风险。GPU 调用是异步；Python 只是排队命令并继续执行。若不使用 torch.cuda.synchronize()，你测量的是排队时间。通过同步，你测量的是 主机到设备的延迟，这通常比内核执行本身长 10 倍。

3. 延迟与吞吐量

为克服这一悖论，你必须提供足够的工作量来“隐藏”启动延迟。这正是从 延迟受限 模式（受限于 CPU-GPU 总线）转变为 吞吐量受限 模式（受限于 GPU 内存或计算能力）。

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

For each kernel, decide whether the bottleneck is likely arithmetic throughput, memory bandwidth, or launch overhead: Vector addition (N=256), Vector addition (N=10^8), and Matrix Multiplication (N=8192).

N=256: Arithmetic; N=10^8: Bandwidth; MM: Launch

N=256: Launch; N=10^8: Bandwidth; MM: Arithmetic

N=256: Bandwidth; N=10^8: Arithmetic; MM: Launch

All are compute-bound.

QUESTION 2

In the context of the Performance Paradox, what is the primary bottleneck for a 'ReLU on a matrix' operation?

Arithmetic Throughput

Memory Bandwidth

L1 Cache Size

QUESTION 3

What does the term 'Asynchronous Execution' imply regarding GPU benchmarking?

The GPU and CPU always finish at the same time.

The CPU continues to the next line of code before the GPU kernel finishes.

The kernel runs faster on smaller GPUs.

Memory transfers are blocked by compute.

QUESTION 4

Why does $out = x + y$ exhibit low arithmetic intensity?

It uses three memory accesses (2 loads, 1 store) for a single floating-point operation.

The addition operation is too complex for the ALUs.

It requires shared memory synchronization.

It only runs on one SM.

QUESTION 5

How can the 'Launch Tax' be amortized in a real-world application?

By calling the kernel more frequently with smaller data.

By increasing the workload per launch (e.g., larger N or batching).

By using 16-bit floats instead of 32-bit floats.

By disabling the L2 cache.